{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following the analysis from 10/30, let's check if using the low confidence denovo calls from GATK helps with the consensus results. Also, let's ignore DenovoGear from now. According to the triodenovo paper, in very few (if any) cases, the calls from DNG are better than triodenovo's."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Within tools stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "suffix=allDeNovo.vcf \n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} | grep -v '#' - | cut -f 1,2 - > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "import numpy as np\n",
    "def do_perms(snps, noften, nboot=10000, npicks=20):\n",
    "    success = 0\n",
    "    for i in range(nboot):\n",
    "        picks = np.random.choice(snps, npicks, replace = True)\n",
    "        counts = stats.itemfreq(picks)\n",
    "        nmax = np.max(counts[:,1])\n",
    "        if (nmax >= noften):\n",
    "            success += 1\n",
    "    return(success/float(nboot))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 3 in 139467 unique snps\n",
      "[['chr1 120598739' 3]\n",
      " ['chr1 142897718' 3]\n",
      " ['chr1 143251223' 3]\n",
      " ['chr1 143420050' 3]\n",
      " ['chr1 143449316' 3]\n",
      " ['chr1 245928122' 3]\n",
      " ['chr10 133135993' 3]\n",
      " ['chr10 19026307' 3]\n",
      " ['chr11 51394527' 3]\n",
      " ['chr11 55046877' 3]\n",
      " ['chr12 118406089' 3]\n",
      " ['chr12 121471931' 3]\n",
      " ['chr12 9015996' 3]\n",
      " ['chr13 19447335' 3]\n",
      " ['chr13 52801661' 3]\n",
      " ['chr13 70469364' 3]\n",
      " ['chr13 73287599' 3]\n",
      " ['chr14 104630801' 3]\n",
      " ['chr14 20004233' 3]\n",
      " ['chr14 35476924' 3]\n",
      " ['chr14 66368478' 3]\n",
      " ['chr15 22049121' 3]\n",
      " ['chr15 78762313' 3]\n",
      " ['chr15 79220567' 3]\n",
      " ['chr16 30199933' 3]\n",
      " ['chr16 32141408' 3]\n",
      " ['chr16 33040759' 3]\n",
      " ['chr16 33865368' 3]\n",
      " ['chr16 33865369' 3]\n",
      " ['chr16 33865375' 3]\n",
      " ['chr16 33865377' 3]\n",
      " ['chr16 33937413' 3]\n",
      " ['chr16 55016333' 3]\n",
      " ['chr16 87747960' 3]\n",
      " ['chr17 21244764' 3]\n",
      " ['chr17 21248698' 3]\n",
      " ['chr17 25296420' 3]\n",
      " ['chr17 78795555' 3]\n",
      " ['chr17 80441236' 3]\n",
      " ['chr18 15172029' 3]\n",
      " ['chr18 34803023' 3]\n",
      " ['chr18 9880504' 3]\n",
      " ['chr19 53184007' 3]\n",
      " ['chr19 9573019' 3]\n",
      " ['chr2 132891778' 3]\n",
      " ['chr2 132891791' 3]\n",
      " ['chr2 162279815' 3]\n",
      " ['chr2 185767291' 3]\n",
      " ['chr2 90469196' 3]\n",
      " ['chr2 96606359' 3]\n",
      " ['chr20 3676780' 3]\n",
      " ['chr21 10010088' 3]\n",
      " ['chr21 10140686' 3]\n",
      " ['chr21 10530298' 3]\n",
      " ['chr21 10698859' 3]\n",
      " ['chr21 11062691' 3]\n",
      " ['chr21 11139941' 3]\n",
      " ['chr21 15344999' 3]\n",
      " ['chr21 15696265' 3]\n",
      " ['chr21 9897066' 3]\n",
      " ['chr22 25342392' 3]\n",
      " ['chr22 27383199' 3]\n",
      " ['chr22 32657767' 3]\n",
      " ['chr3 113127618' 3]\n",
      " ['chr3 125478931' 3]\n",
      " ['chr3 13372529' 3]\n",
      " ['chr3 21462544' 3]\n",
      " ['chr3 75693191' 3]\n",
      " ['chr3 75834866' 3]\n",
      " ['chr3 75834883' 3]\n",
      " ['chr3 75834885' 3]\n",
      " ['chr3 75834990' 3]\n",
      " ['chr4 190884774' 3]\n",
      " ['chr4 3589043' 3]\n",
      " ['chr4 36901539' 3]\n",
      " ['chr4 49103631' 3]\n",
      " ['chr4 49274823' 3]\n",
      " ['chr4 63374577' 3]\n",
      " ['chr4 72510' 3]\n",
      " ['chr5 25882059' 3]\n",
      " ['chr5 72856495' 3]\n",
      " ['chr5 76442187' 3]\n",
      " ['chr5 79086274' 3]\n",
      " ['chr5 9622117' 3]\n",
      " ['chr6 133071642' 3]\n",
      " ['chr6 137222387' 3]\n",
      " ['chr6 32496974' 3]\n",
      " ['chr6 36395761' 3]\n",
      " ['chr6 57406052' 3]\n",
      " ['chr6 57406054' 3]\n",
      " ['chr7 133160732' 3]\n",
      " ['chr7 143270465' 3]\n",
      " ['chr7 154002420' 3]\n",
      " ['chr7 157652301' 3]\n",
      " ['chr7 56891793' 3]\n",
      " ['chr7 57870746' 3]\n",
      " ['chr7 57870756' 3]\n",
      " ['chr7 57870770' 3]\n",
      " ['chr7 57870796' 3]\n",
      " ['chr7 76111297' 3]\n",
      " ['chr7 98030513' 3]\n",
      " ['chr8 39717511' 3]\n",
      " ['chr8 41486144' 3]\n",
      " ['chr8 74658583' 3]\n",
      " ['chr8 9789120' 3]\n",
      " ['chr9 10829830' 3]\n",
      " ['chr9 32985273' 3]\n",
      " ['chr9 66800814' 3]\n",
      " ['chr9 68401055' 3]\n",
      " ['chr9 69877791' 3]\n",
      " ['chrY 58980351' 3]\n",
      " ['chrY 59014761' 3]]\n"
     ]
    }
   ],
   "source": [
    "snps_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/interesting_snvs_allDeNovo.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_gatk = stats.itemfreq(snps_gatk['snp'])\n",
    "my_max = np.max(counts_gatk[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatk))\n",
    "print counts_gatk[counts_gatk[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "do_perms(counts_gatk[:, 0], 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, now we have several SNPs with at least 3 families calling it interesting... that's better, even though not all of them are high confidence according to GATK. Still, there are a bit too many for further investigation. Let's see if any of them agree with Triodenovo's calls."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Triodenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "rm: cannot remove `interesting_snvs_denovo_v2.vcf.txt': No such file or directory\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=denovo_v2.vcf \n",
    "cd /data/NCR_SBRB/simplex/triodenovo\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 5 in 18802 unique snps\n",
      "[['chr1 16914580' 5]\n",
      " ['chr10 49319323' 5]\n",
      " ['chr12 69667893' 5]\n",
      " ['chrX 102973509' 5]\n",
      " ['chrX 118751076' 5]\n",
      " ['chrX 13397236' 5]\n",
      " ['chrX 135307049' 5]\n",
      " ['chrX 14599572' 5]\n",
      " ['chrX 152610294' 5]\n",
      " ['chrX 153008911' 5]\n",
      " ['chrX 153880181' 5]\n",
      " ['chrX 153904473' 5]\n",
      " ['chrX 154774663' 5]\n",
      " ['chrX 16876980' 5]\n",
      " ['chrX 38262808' 5]\n",
      " ['chrX 43601142' 5]\n",
      " ['chrX 96502650' 5]]\n"
     ]
    }
   ],
   "source": [
    "snps_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/interesting_snvs_denovo_v2.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_tdn = stats.itemfreq(snps_tdn['snp'])\n",
    "my_max = np.max(counts_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_tdn))\n",
    "print counts_tdn[counts_tdn[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, do any of these SNPs with 5 families also show up in the GATK calls?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n"
     ]
    }
   ],
   "source": [
    "best_tdn = counts_tdn[counts_tdn[:, 1] == 5, 0]\n",
    "best_gatk = counts_gatk[counts_gatk[:, 1] == 3, 0]\n",
    "joint = [s for s in best_tdn if s in best_gatk]\n",
    "print joint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whomp whomp..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Within group stats\n",
    "\n",
    "The approach here is to calculate the best stats in all ADHD samples, and see what's the best we can do for a specific variable in non-affected siblings. For comparison, we can do it the other way around as well.\n",
    "\n",
    "## Triodenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/triodenovo/\n",
    "\n",
    "# concatenate all affected trios and extract the SNPs\n",
    "rm affected_snvs.txt unaffected_snvs.txt;\n",
    "for f in `ls *_trio1_denovo_v2.vcf`; do\n",
    "    cat $f | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;\n",
    "done\n",
    "# need to be careful not to double count unaffected SNPs\n",
    "while read fam; do\n",
    "    if [ -e ${fam}_trio2_denovo_v2.vcf ]; then\n",
    "        cat ${fam}_trio[2..4]_denovo_v2.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;\n",
    "    fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 14 in 21 affected families\n",
      "[['chrX:41093413' 14]]\n",
      "Maximum frequency: 11 in 29 unaffected families\n",
      "[['chrX:100098355' 11]\n",
      " ['chrX:150573743' 11]\n",
      " ['chrX:150575385' 11]\n",
      " ['chrX:27479339' 11]\n",
      " ['chrX:9686187' 11]]\n"
     ]
    }
   ],
   "source": [
    "naff = 21\n",
    "nunaff = 29\n",
    "aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])\n",
    "my_max = np.max(counts_aff_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]\n",
    "\n",
    "unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])\n",
    "my_max = np.max(counts_unaff_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we found a mutation in 14 out of the 21 affected families, but we can also find mutations happening in 11 out of 29 unaffected families. So, 14 is not terribly impressive. But let's see how often this one mutation happens in unaffected famlies?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([9], dtype=object)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counts_unaff_tdn[counts_unaff_tdn[:, 0] == 'chrX:41093413', 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can potentially assign some significance to that... but what if we strict ourselves to autossomal chromosomes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 9 in 21 affected families\n",
      "[['chr13:42894052' 9]\n",
      " ['chr3:39556476' 9]\n",
      " ['chrM:3012' 9]]\n",
      "Maximum frequency: 8 in 29 unaffected families\n",
      "[['chr1:7889972' 8]\n",
      " ['chrM:3012' 8]]\n",
      "Counts chr13:42894052 shows up in unaffected: 4\n",
      "Counts chr3:39556476 shows up in unaffected: 7\n",
      "Counts chrM:3012 shows up in unaffected: 8\n"
     ]
    }
   ],
   "source": [
    "aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])\n",
    "unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])\n",
    "\n",
    "keep_me = [s for s, snp in enumerate(counts_aff_tdn[:, 0]) if snp.find('chrX') < 0]\n",
    "counts_aff_tdn = counts_aff_tdn[keep_me, ]\n",
    "keep_me = [s for s, snp in enumerate(counts_unaff_tdn[:, 0]) if snp.find('chrX') < 0]\n",
    "counts_unaff_tdn = counts_unaff_tdn[keep_me, ]\n",
    "\n",
    "my_max = np.max(counts_aff_tdn[:, 1])\n",
    "max_str = counts_aff_tdn[counts_aff_tdn[:, 1] == my_max, 0]\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]\n",
    "my_max = np.max(counts_unaff_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]\n",
    "for s in max_str:\n",
    "    print 'Counts %s shows up in unaffected: %d' % (s, counts_unaff_tdn[counts_unaff_tdn[:, 0] == s, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "\n",
    "# concatenate all affected trios and extract the SNPs\n",
    "rm affected_snvs.txt unaffected_snvs.txt;\n",
    "for f in `ls *_trio1_allDeNovo.vcf`; do\n",
    "    cat $f | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;\n",
    "done\n",
    "for f in `ls *_trio[2..4]_allDeNovo.vcf`; do\n",
    "    cat $f | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 3 in 21 affected families\n",
      "Maximum frequency: 3 in 29 unaffected families\n",
      "Counts chr10:133135993 shows up in unaffected: 0\n",
      "Counts chr10:19026307 shows up in unaffected: 0\n",
      "Counts chr11:51394527 shows up in unaffected: 0\n",
      "Counts chr11:55046877 shows up in unaffected: 0\n",
      "Counts chr12:118406089 shows up in unaffected: 0\n",
      "Counts chr12:121471931 shows up in unaffected: 0\n",
      "Counts chr12:9015996 shows up in unaffected: 0\n",
      "Counts chr13:19447335 shows up in unaffected: 0\n",
      "Counts chr13:52801661 shows up in unaffected: 0\n",
      "Counts chr13:70469364 shows up in unaffected: 0\n",
      "Counts chr13:73287599 shows up in unaffected: 0\n",
      "Counts chr14:104630801 shows up in unaffected: 0\n",
      "Counts chr14:20004233 shows up in unaffected: 0\n",
      "Counts chr14:35476924 shows up in unaffected: 0\n",
      "Counts chr14:66368478 shows up in unaffected: 0\n",
      "Counts chr14:93118237 shows up in unaffected: 0\n",
      "Counts chr15:22049121 shows up in unaffected: 0\n",
      "Counts chr15:78762313 shows up in unaffected: 0\n",
      "Counts chr15:79220567 shows up in unaffected: 0\n",
      "Counts chr16:14974528 shows up in unaffected: 0\n",
      "Counts chr16:30199933 shows up in unaffected: 0\n",
      "Counts chr16:32141408 shows up in unaffected: 0\n",
      "Counts chr16:33040759 shows up in unaffected: 0\n",
      "Counts chr16:33865368 shows up in unaffected: 0\n",
      "Counts chr16:33865369 shows up in unaffected: 0\n",
      "Counts chr16:33865375 shows up in unaffected: 0\n",
      "Counts chr16:33865377 shows up in unaffected: 0\n",
      "Counts chr16:33937413 shows up in unaffected: 0\n",
      "Counts chr16:55016333 shows up in unaffected: 0\n",
      "Counts chr16:87747960 shows up in unaffected: 0\n",
      "Counts chr17:21244764 shows up in unaffected: 0\n",
      "Counts chr17:21248698 shows up in unaffected: 0\n",
      "Counts chr17:25296420 shows up in unaffected: 0\n",
      "Counts chr17:78795555 shows up in unaffected: 0\n",
      "Counts chr17:80441236 shows up in unaffected: 0\n",
      "Counts chr18:15172029 shows up in unaffected: 0\n",
      "Counts chr18:34803023 shows up in unaffected: 0\n",
      "Counts chr18:9431365 shows up in unaffected: 0\n",
      "Counts chr18:9880504 shows up in unaffected: 0\n",
      "Counts chr19:53184007 shows up in unaffected: 0\n",
      "Counts chr19:53899008 shows up in unaffected: 0\n",
      "Counts chr19:9573019 shows up in unaffected: 0\n",
      "Counts chr1:120598739 shows up in unaffected: 0\n",
      "Counts chr1:142897718 shows up in unaffected: 0\n",
      "Counts chr1:143251223 shows up in unaffected: 0\n",
      "Counts chr1:143420050 shows up in unaffected: 0\n",
      "Counts chr1:143449316 shows up in unaffected: 0\n",
      "Counts chr1:16389362 shows up in unaffected: 0\n",
      "Counts chr1:245928122 shows up in unaffected: 0\n",
      "Counts chr20:24608392 shows up in unaffected: 0\n",
      "Counts chr20:3676780 shows up in unaffected: 0\n",
      "Counts chr21:10010088 shows up in unaffected: 0\n",
      "Counts chr21:10140686 shows up in unaffected: 0\n",
      "Counts chr21:10530298 shows up in unaffected: 0\n",
      "Counts chr21:10698859 shows up in unaffected: 0\n",
      "Counts chr21:11062691 shows up in unaffected: 0\n",
      "Counts chr21:11084916 shows up in unaffected: 0\n",
      "Counts chr21:11139941 shows up in unaffected: 0\n",
      "Counts chr21:15344999 shows up in unaffected: 0\n",
      "Counts chr21:15431389 shows up in unaffected: 0\n",
      "Counts chr21:15696265 shows up in unaffected: 0\n",
      "Counts chr21:9897066 shows up in unaffected: 0\n",
      "Counts chr22:25342392 shows up in unaffected: 0\n",
      "Counts chr22:27383199 shows up in unaffected: 0\n",
      "Counts chr22:32657767 shows up in unaffected: 0\n",
      "Counts chr22:39526296 shows up in unaffected: 0\n",
      "Counts chr2:120620791 shows up in unaffected: 0\n",
      "Counts chr2:132891778 shows up in unaffected: 0\n",
      "Counts chr2:132891791 shows up in unaffected: 0\n",
      "Counts chr2:158731512 shows up in unaffected: 0\n",
      "Counts chr2:162279815 shows up in unaffected: 0\n",
      "Counts chr2:185767291 shows up in unaffected: 0\n",
      "Counts chr2:90469196 shows up in unaffected: 0\n",
      "Counts chr2:96606359 shows up in unaffected: 0\n",
      "Counts chr3:113127618 shows up in unaffected: 0\n",
      "Counts chr3:125478931 shows up in unaffected: 0\n",
      "Counts chr3:13372529 shows up in unaffected: 0\n",
      "Counts chr3:21462544 shows up in unaffected: 0\n",
      "Counts chr3:75693191 shows up in unaffected: 0\n",
      "Counts chr3:75699204 shows up in unaffected: 0\n",
      "Counts chr3:75775635 shows up in unaffected: 0\n",
      "Counts chr3:75775644 shows up in unaffected: 0\n",
      "Counts chr3:75794129 shows up in unaffected: 0\n",
      "Counts chr3:75794134 shows up in unaffected: 0\n",
      "Counts chr3:75794146 shows up in unaffected: 0\n",
      "Counts chr3:75834866 shows up in unaffected: 0\n",
      "Counts chr3:75834883 shows up in unaffected: 0\n",
      "Counts chr3:75834885 shows up in unaffected: 0\n",
      "Counts chr3:75834990 shows up in unaffected: 0\n",
      "Counts chr4:190884774 shows up in unaffected: 0\n",
      "Counts chr4:3589043 shows up in unaffected: 0\n",
      "Counts chr4:36901539 shows up in unaffected: 0\n",
      "Counts chr4:49103631 shows up in unaffected: 0\n",
      "Counts chr4:49274823 shows up in unaffected: 0\n",
      "Counts chr4:63374577 shows up in unaffected: 0\n",
      "Counts chr4:72510 shows up in unaffected: 0\n",
      "Counts chr5:25882059 shows up in unaffected: 0\n",
      "Counts chr5:6447774 shows up in unaffected: 0\n",
      "Counts chr5:72856495 shows up in unaffected: 0\n",
      "Counts chr5:76442187 shows up in unaffected: 0\n",
      "Counts chr5:79086274 shows up in unaffected: 0\n",
      "Counts chr5:9622117 shows up in unaffected: 0\n",
      "Counts chr6:133071642 shows up in unaffected: 0\n",
      "Counts chr6:137222387 shows up in unaffected: 0\n",
      "Counts chr6:32496974 shows up in unaffected: 0\n",
      "Counts chr6:32545806 shows up in unaffected: 0\n",
      "Counts chr6:36395761 shows up in unaffected: 0\n",
      "Counts chr6:57406052 shows up in unaffected: 0\n",
      "Counts chr6:57406054 shows up in unaffected: 0\n",
      "Counts chr7:133160732 shows up in unaffected: 0\n",
      "Counts chr7:143270465 shows up in unaffected: 0\n",
      "Counts chr7:154002420 shows up in unaffected: 0\n",
      "Counts chr7:157652301 shows up in unaffected: 0\n",
      "Counts chr7:56891793 shows up in unaffected: 0\n",
      "Counts chr7:57870746 shows up in unaffected: 0\n",
      "Counts chr7:57870756 shows up in unaffected: 0\n",
      "Counts chr7:57870770 shows up in unaffected: 0\n",
      "Counts chr7:57870796 shows up in unaffected: 0\n",
      "Counts chr7:76111297 shows up in unaffected: 0\n",
      "Counts chr7:98030513 shows up in unaffected: 0\n",
      "Counts chr8:39717511 shows up in unaffected: 0\n",
      "Counts chr8:41486144 shows up in unaffected: 0\n",
      "Counts chr8:74658583 shows up in unaffected: 0\n",
      "Counts chr8:9789120 shows up in unaffected: 0\n",
      "Counts chr9:10829830 shows up in unaffected: 0\n",
      "Counts chr9:32985273 shows up in unaffected: 0\n",
      "Counts chr9:43012005 shows up in unaffected: 0\n",
      "Counts chr9:66800814 shows up in unaffected: 0\n",
      "Counts chr9:68401055 shows up in unaffected: 0\n",
      "Counts chr9:69877791 shows up in unaffected: 0\n",
      "Counts chrY:58980351 shows up in unaffected: 0\n",
      "Counts chrY:59014761 shows up in unaffected: 0\n"
     ]
    }
   ],
   "source": [
    "naff = 21\n",
    "nunaff = 29\n",
    "aff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_gatk = stats.itemfreq(aff_gatk['snp'])\n",
    "my_max = np.max(counts_aff_gatk[:, 1])\n",
    "max_str = counts_aff_gatk[counts_aff_gatk[:, 1] == my_max, 0]\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "unaff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_gatk = stats.itemfreq(unaff_gatk['snp'])\n",
    "my_max = np.max(counts_unaff_gatk[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "for s in max_str:\n",
    "    if s in counts_unaff_gatk[:, 0]:\n",
    "        mycount = counts_unaff_gatk[counts_unaff_gatk[:, 0] == s, 1]\n",
    "    else:\n",
    "        mycount = 0\n",
    "    print 'Counts %s shows up in unaffected: %d' % (s, mycount)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also not good..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Some numbers\n",
    "\n",
    "First, how many DNVs do we get in each trio, per tool?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TrioDenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033_trio1_denovo_v2.vcf 1188\n",
      "10033_trio2_denovo_v2.vcf 2353\n",
      "10042_trio1_denovo_v2.vcf 2399\n",
      "10090_trio1_denovo_v2.vcf 1884\n",
      "10090_trio2_denovo_v2.vcf 1712\n",
      "10094_trio1_denovo_v2.vcf 1895\n",
      "10094_trio2_denovo_v2.vcf 1846\n",
      "10128_trio1_denovo_v2.vcf 2201\n",
      "10128_trio2_denovo_v2.vcf 2157\n",
      "10131_trio1_denovo_v2.vcf 2033\n",
      "10131_trio2_denovo_v2.vcf 1125\n",
      "10131_trio3_denovo_v2.vcf 1164\n",
      "10131_trio4_denovo_v2.vcf 1333\n",
      "10153_trio1_denovo_v2.vcf 1119\n",
      "10153_trio2_denovo_v2.vcf 1912\n",
      "10153_trio3_denovo_v2.vcf 1007\n",
      "10164_trio1_denovo_v2.vcf 2141\n",
      "10164_trio2_denovo_v2.vcf 1869\n",
      "10173_trio1_denovo_v2.vcf 1897\n",
      "10173_trio2_denovo_v2.vcf 1916\n",
      "10178_trio1_denovo_v2.vcf 2577\n",
      "10178_trio2_denovo_v2.vcf 2423\n",
      "10182_trio1_denovo_v2.vcf 2347\n",
      "10182_trio2_denovo_v2.vcf 1333\n",
      "10182_trio3_denovo_v2.vcf 1830\n",
      "10197_trio1_denovo_v2.vcf 1281\n",
      "10197_trio2_denovo_v2.vcf 1179\n",
      "10215_trio1_denovo_v2.vcf 2311\n",
      "10215_trio2_denovo_v2.vcf 2270\n",
      "10215_trio3_denovo_v2.vcf 1122\n",
      "10215_trio4_denovo_v2.vcf 1177\n",
      "10369_trio1_denovo_v2.vcf 600\n",
      "10369_trio2_denovo_v2.vcf 628\n",
      "10406_trio1_denovo_v2.vcf 1560\n",
      "10406_trio2_denovo_v2.vcf 853\n",
      "10406_trio3_denovo_v2.vcf 1673\n",
      "10448_trio1_denovo_v2.vcf 1870\n",
      "10448_trio2_denovo_v2.vcf 1737\n",
      "10459_trio2_denovo_v2.vcf 1914\n",
      "1892_trio1_denovo_v2.vcf 1016\n",
      "1892_trio2_denovo_v2.vcf 941\n",
      "1893_trio1_denovo_v2.vcf 2816\n",
      "1893_trio2_denovo_v2.vcf 2607\n",
      "1895_trio1_denovo_v2.vcf 2016\n",
      "1895_trio2_denovo_v2.vcf 2080\n",
      "1976_trio1_denovo_v2.vcf 2164\n",
      "1976_trio2_denovo_v2.vcf 2225\n",
      "1976_trio3_denovo_v2.vcf 1126\n",
      "855_trio1_denovo_v2.vcf 1324\n",
      "855_trio2_denovo_v2.vcf 1940\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "cd /data/NCR_SBRB/simplex/triodenovo\n",
    "for f in `ls *_trio?_denovo_v2.vcf`; do\n",
    "    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033_trio1_allDeNovo.vcf 3702\n",
      "10033_trio2_allDeNovo.vcf 6858\n",
      "10042_trio1_allDeNovo.vcf 4431\n",
      "10090_trio1_allDeNovo.vcf 4607\n",
      "10090_trio2_allDeNovo.vcf 3896\n",
      "10094_trio1_allDeNovo.vcf 4511\n",
      "10094_trio2_allDeNovo.vcf 3632\n",
      "10128_trio1_allDeNovo.vcf 3707\n",
      "10128_trio2_allDeNovo.vcf 3098\n",
      "10131_trio1_allDeNovo.vcf 3571\n",
      "10131_trio2_allDeNovo.vcf 3924\n",
      "10131_trio3_allDeNovo.vcf 3078\n",
      "10131_trio4_allDeNovo.vcf 4552\n",
      "10153_trio1_allDeNovo.vcf 4184\n",
      "10153_trio2_allDeNovo.vcf 4603\n",
      "10153_trio3_allDeNovo.vcf 3912\n",
      "10164_trio1_allDeNovo.vcf 5132\n",
      "10164_trio2_allDeNovo.vcf 3549\n",
      "10173_trio1_allDeNovo.vcf 4318\n",
      "10173_trio2_allDeNovo.vcf 3454\n",
      "10178_trio1_allDeNovo.vcf 9729\n",
      "10178_trio2_allDeNovo.vcf 6656\n",
      "10182_trio1_allDeNovo.vcf 5257\n",
      "10182_trio2_allDeNovo.vcf 4141\n",
      "10182_trio3_allDeNovo.vcf 3044\n",
      "10197_trio1_allDeNovo.vcf 7505\n",
      "10197_trio2_allDeNovo.vcf 5022\n",
      "10215_trio1_allDeNovo.vcf 6976\n",
      "10215_trio2_allDeNovo.vcf 4085\n",
      "10215_trio3_allDeNovo.vcf 3993\n",
      "10215_trio4_allDeNovo.vcf 5384\n",
      "10369_trio1_allDeNovo.vcf 58406\n",
      "10369_trio2_allDeNovo.vcf 60085\n",
      "10406_trio1_allDeNovo.vcf 3740\n",
      "10406_trio2_allDeNovo.vcf 2672\n",
      "10406_trio3_allDeNovo.vcf 3697\n",
      "10448_trio1_allDeNovo.vcf 5537\n",
      "10448_trio2_allDeNovo.vcf 4782\n",
      "10459_trio2_allDeNovo.vcf 5381\n",
      "1892_trio1_allDeNovo.vcf 3122\n",
      "1892_trio2_allDeNovo.vcf 2466\n",
      "1893_trio1_allDeNovo.vcf 6050\n",
      "1893_trio2_allDeNovo.vcf 6802\n",
      "1895_trio1_allDeNovo.vcf 3922\n",
      "1895_trio2_allDeNovo.vcf 3218\n",
      "1976_trio1_allDeNovo.vcf 6067\n",
      "1976_trio2_allDeNovo.vcf 6792\n",
      "1976_trio3_allDeNovo.vcf 4826\n",
      "855_trio1_allDeNovo.vcf 5250\n",
      "855_trio2_allDeNovo.vcf 4236\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "for f in `ls *_trio?_allDeNovo.vcf`; do\n",
    "    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many of those are only show up in the affected trio?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TrioDenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033 956\n",
      "10090 1238\n",
      "10094 1052\n",
      "10128 1080\n",
      "10131 1707\n",
      "10153 760\n",
      "10164 1515\n",
      "10173 1105\n",
      "10178 1720\n",
      "10182 1105\n",
      "10197 1078\n",
      "10215 960\n",
      "10369 402\n",
      "10406 826\n",
      "10448 1164\n",
      "1892 823\n",
      "1893 1625\n",
      "1895 1209\n",
      "1976 1150\n",
      "855 1075\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=denovo_v2.vcf \n",
    "cd /data/NCR_SBRB/simplex/triodenovo\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;\n",
    "      rm interesting_snvs_${suffix}.txt\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033 3491\n",
      "10090 4436\n",
      "10094 4340\n",
      "10128 3590\n",
      "10131 3282\n",
      "10153 3862\n",
      "10164 4864\n",
      "10173 4119\n",
      "10178 8915\n",
      "10182 4784\n",
      "10197 6984\n",
      "10215 6096\n",
      "10369 52928\n",
      "10406 3401\n",
      "10448 5365\n",
      "1892 2997\n",
      "1893 5727\n",
      "1895 3755\n",
      "1976 5628\n",
      "855 5088\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=allDeNovo.vcf \n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} | grep -v '#' - | cut -f 1,2 -> ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;\n",
    "      rm interesting_snvs_${suffix}.txt\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now find a few intersections. First, for all called SNPs:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, intersections for the SNPs in affected sibs only:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* GET NUMBER OF POSSIBLE DE NOVO PER FAMILY\n",
    "* MAKE VENN DIAGRAMS?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:py2.7.10]",
   "language": "python",
   "name": "conda-env-py2.7.10-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  },
  "nav_menu": {},
  "toc": {
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 6,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
